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1 .  Introduction 

RPC2  is  a  remote  procedure  call  mechanism  that  has  been  used  extensively  in  the  Andrew  distributed 
computing  environment  at  Carnegie  Mellon  University  (12],  A  detailed  description  of  RPC2  may  be  found 
elsewhere  [16, 15).  MuliiRPC  is  an  extension  to  RPC2  that  enables  a  client  to  perform  remote  invocations  of 
multiple  servers  while  retaining  the  reliability  characteristics  of  remote  procedure  calls.  In  tliis  paper  we 
describe  .MultiRPC  and  show  how  we  have  made  it  fast,  versatile  and  simple  to  use. 

Section  2  presents  an  overview  of  RPC2.  Section  3  explains  why  we  extended  RPC2  with  a  parallel 
invocation  mechanism,  'fhe  considerations  that  influenced  the  design  of  MultiRPC  are  put  fortli  in  Section  4. 
Section  5  describes  the  design  and  implementation  of  MultiRPC,  explores  some  of  the  subtle  consequences  of 
our  original  design  decisions,  and  motivates  tlie  rcsulung  modifications.  Section  6  discusses  the  experimental 
evaluation  of  the  system.  An  analytic  model  is  derived,  and  validated  by  comparing  its  predictions  to  the 
results  of  controlled  experiments.  Sections  7  relates  this  work  to  otlier  efforts  relating  to  parallelism  in 
network  communication.  Section  8  concludes  the  paper  with  an  overview  of  work  in  progress. 

2.  Overview  of  RPC2 

RPC2  consists  of  two  relatively  independent  components;  a  Unix-based  runtime  library  written  in  C,  and  a 
stub  generator,  RP2Gen.  Tltc  runtime  system  is  self-contained  and  is  usable  in  the  absence  of  RP2Gen.  The 
code  in  the  stubs  generated  by  RP2Gen  is,  however,  specific  to  RPC2. 

A  subsystem  is  a  set  of  related  remote  procedure  calls  that  make  up  a  remote  interface.  RP2Gen  takes  a 
description  of  a  subsystem  and  automatically  generates  code  to  marshall  and  unmarshall  parameters  in  the 
client  and  server  stubs.  It  thus  performs  a  fii action  similar  to  Lupine  in  the  Xerox  RPC  mechanism  [1]  and 
Matchmaker  in  Accent  IPC  [10]. 

Ihe  RPC2  runtime  system  is  fully  integrated  with  a  Lightweight  Process  mechanism  {LWFO  [13]  that  supports 
multiple  nonpreemptive  threads  of  control  within  a  single  Unix  process.  When  a  remote  procedure  is 
invoked,  the  calling  LWP  is  suspended  until  the  call  is  complete.  Other  LWPs  in  the  same  Unix  process  are, 
however,  still  runnable.  The  LWP  package  allows  independent  threads  of  control  to  share  virtual  memory,  a 
feature  that  is  not  present  in  standard  Unix.  Both  RPC2  and  the  LWP  package  arc  entirely  outside  the  Unix 
kernel  and  have  been  ported  to  multiple  machine  types. 

The  low-level  packet^  transport  mechanism  is  a  separable  component  of  RPC2.  The  only  primitives  required 
of  it  are  the  ability  to  send  and  receive  datagrams.  At  present,  RPC2  runs  on  the  DARPA  IP/UDP 
protocol  (6, 7). 

RPC2  is  based  on  logical  connections.  The  rationale  for  choosing  a  connection-based  rather  than 
connectionless  protocol  is  presented  in  the  design  document  [16].  For  the  purposes  of  this  paper  the  following 
facts  about  RPC2  connections  arc  relevant: 

1.  A  connection  is  created  when  a  client  invokes  the  bind  primitive  and  is  destroyed  by  the  unbind 
primitive.  The  cost  of  a  bind  is  comparable  to  the  cost  of  a  normal  RPC.' 

2.  One  can  view  BIND  as  a  special  RPC  tliat  is  common  to  all  subsystems.  In  fact,  a  server  is  notified 
of  the  creation  of  a  new  connection  in  exactly  the  same  way  it  is  notified  of  an  RPC  on  an  existing 
connection. 

throughout  this  paper  we  use  the  term  ‘packet”  to  mean  a  logicsl  packet  In  some  networks  a  packet  may  be  physically  transmitted  as 
multiple  fragments.  Such  riagmeniation  is  transparent  to  RPC2. 


3.  Connections  use  little  storage.  Typically  a  connection  requires  a  hundred  bytes  at  each  of  the 
client  and  server  ends.  No  other  resources  arc  used  by  a  connection. 

4.  Within  each  Uni.x  process,  an  RPC2  connection  is  identified  by  a  unique  handle.  Handles  arc 
never  reused  during  the  life  of  a  process. 

5.  At  any  given  time  a  Unix  process  can  have  at  most  64K  active  connections.  This  is  about  two 
orders  of  magnitude  larger  than  the  number  of  connections  in  use  in  the  most  heavily  loaded 
Andrew  servers. 

Although  one  speaks  of  “clients”  and  “servers”,  it  should  be  noted  that  the  mechanism  is  completely 
symmetric.  A  server  can  be  a  client  to  many  other  servers,  and  a  client  may  be  the  server  to  many  other 
clients.  On  a  given  connection,  however,  the  roles  of  the  peers  are  fixed. 

Besides  ni.ND  and  LNUIND,  the  most  important  rundme  primitive  on  tlic  client  side  is  makerpc.  This  call 
sends  a  request  packet  to  a  server  and  then  waits  for  a  reply.  Reliable  delivery  is  guaranteed  by  a 
retransmission  protocol  built  on  top  of  the  datagram  transport  mechanism.  Calls  may  take  an  arbitrary  length 
of  dme;  in  response  to  client  retries  the  server  sends  keep-alive  packets  {BUSY  packets)  to  indicate  tliat  it  is 
sull  alive  and  connected  to  the  network.  On  the  server  side,  the  basic  primidves  are  GETREQLEST,  which 
blocks  undl  a  request  is  received  and  sendresponse,  which  sends  out  a  reply  packet.  RPC2  provides 
exactlyonce  semandcs  in  the  absence  of  site  and  hard  network  failures,  and  at-mosl-once  scmandcs 
otherwise  (17). 

A  unique  aspect  of  RPC2  is  its  support  of  arbitrary  side  effects  on  RPC  calls.  The  side  effect  mechanism 
allows  application-specific  protocols  to  be  integrated  with  the  base  RPC2  code.  Side  effects  and  a  number  of 
other  RPC2  features  arc  discussed  in  the  design  documenL  Tables  2-1,  2-2,  and  2-3  summarise  the  RPC2 
primitives  relevant  to  this  paper. 


Primitive 

Description 

BIND 

Create  a  new  connection 

.MAKEKFC 

Make  a  remote  procedure  call 

MULTIRPC 

Make  a  collection  of  remote  procedure  calls 

Table  2-1:  Client  Primitives 


PrimitiTC 

Dexriptioa 

EXPOKT 

OEEXPORT 

CETRGQUm 

ENABLE 

SENDEESTONSe 

INirStDEEFFBCT 

CHKKStDBEPFBCT 

Indicate  willingness  to  accept  calls  for  a  subsystem 

Stop  accepting  new  connections  for  one  or  all  subsystems 
Wait  for  an  RPC  request  or  a  new  connection 

Allow  servicing  of  requests  on  a  new  connection 

Respond  to  a  request  from  a  client 

Initiate  side  effect 

Check  progress  of  side  effect 

Tabic  2-2:  Server  Primitives 
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Primitive 


ALLCXQCrFER 

FREEflLFrER 


Description 

Perform  runtime  system  initialisation 
Terminate  a  connection  by  client  or  server 
Allocate  a  packet  buffer 
Free  a  packet  buffer 

Tabic  2-3:  Miscellaneous  Pdmitives 


3.  Motivation 

I  he  principles  underlying  MiiltiRPC  arose  as  a  solution  to  a  specific  problem  in  Andrew.  In  the  Andrew  file 
system  [14],  workstations  fetch  files  from  servers  and  cache  them  on  their  local  disks.  In  order  to  maintain  the 
consistency  of  the  caches,  servers  maintain  callback  state  about  the  files  cached  by  workstations.  A  callback 
on  a  file  is  essentially  a  commitment  by  a  server  to  a  workstation  that  it  will  notify  the  latter  of  any  change  to 
the  file.  This  guarantee  maintains  consistency  while  allowing  workstations  to  use  cached  data  without 
contacting  the  server  on  each  access.  Before  a  file  may  be  modified  on  the  server,  every  workstation  that  has  a 
callback  on  the  file  must  be  notified.  Since  the  system  is  ultimately  expected  to  encompass  over  5000 
workstations,  an  update  to  a  popular  file  may  involve  a  callback  RPC  to  hundreds  or  thousands  of 
workstations.  The  problem  is  exacerbated  by  the  fact  that  a  callback  RPC  to  a  dead  or  unreachable 
workstation  must  time  out  before  the  connection  is  declared  broken  and  the  next  workstation  tried.  Each 
such  workstation  would  cause  a  delay  of  many  seconds,  rather  than  the  few  tens  of  milliseconds  typical  of 
RPC  roundtrip  times  for  simple  requests.  Given  these  observations,  we  felt  that  the  potential  delay  in 
updating  widely-cached  files  would  be  unacceptable  if  we  were  restricted  to  using  simple  RPC  calls 
iteratively. 

A  simple  broadcast  of  callback  information  is  not  feasible.  With  broadcast,  every  time  a  file  is  changed 
anywhere  in  the  system  every  workstation  would  have  to  process  a  callback  packet  and  determine  if  the  packet 
were  relevant  to  that  workstation.  Using  multicast  to  narrow  the  set  of  workstations  contacted  is  also 
impractical,  because  each  file  would  then  potentially  have  to  correspond  to  a  distinct  multicast  address.  Since 
workstations  flush  and  replace  cache  entries  fncquently,  the  membership  of  multicast  groups  would  be  highly 
dynamic  and  difficult  to  maintain  in  a  consistent  manner. 

Besides  these  considerations,  the  use  of  broadcast  or  multicast  docs  not  provide  servers  with  confirmation  that 
individual  workstations  have  indeed  received  the  callback  infonnation.  Such  confirmation  is  implicit  in  the 
reliable  delivery  semantics  of  RPC.  It  became  clear  to  us  that  we  needed  a  mechanism  that  retained  strict 
RPC  semantics  while  overlapping  the  computation  and  communication  overheads  at  each  of  the  destinations. 
This  is  the  essence  of  MultiRPC. 

MultiRPC  has  applications  in  other  contexts  too.  Replication  algorithms  such  as  quorum  consensus  [9] 
require  multiple  network  sites  to  be  contacted  in  order  to  perform  an  operation.  The  request  to  each  site  is 
usually  the  same,  although  the  returned  information  may  be  different.  MultiRPC  could  be  used  to 
considerably  enhance  the  performance  of  such  algorithms.  'Ihe  performance  of  some  relatively  simple  but 
frequent  operations  in  large  distributed  systems  may  also  be  improved  by  MultiRPC.  Consider,  for  example, 
the  contacting  of  a  name  or  time  server.  If  more  than  one  such  server  is  available,  it  may  be  reasonable  to  use 
MultiRPC  to  contact  many  of  them,  wait  for  the  earliest  reply  and  abandon  all  further  replies. 


4.  Design  Considerations 

The  primary  consideration  in  the  design  of  MulURPC  was  that  it  be  inexpensive.  We  did  not  want  normal 
RPC  calls  to  be  slowed  down  because  of  MultiRPC.  Although  onc-to-many  RPC  calls  constituted  a  very 
important  special  case,  we  expected  simple,  one-to-one  RPC  calls  to  be  preponderant.  A  related,  but  distinct, 
concern  was  the  increase  in  program  size  resulting  from  MultiRPC.  Since  virtual  memory  usage  in  our 
worksutions  was  already  high,  we  wished  to  keep  MultiRPC  small. 

Another  influence  on  our  design  was  the  desire  to  decouple  the  design  of  subsystems  from  considerations 
relating  to  MultiRPC.  We  did  not  want  to  require  any  changes  to  clients  who  used  only  RPC2,  or  to  servers. 
Our  view  was  that  only  clients  who  wished  to  access  multiple  sites  in  parallel  should  have  to  know  about 
MultiRPC. 

Since  we  insisted  on  allowing  simple  RPC  and  MultiRPC  calls  in  any  order  on  any  combination  of 
connections,  .MultiRPC  had  to  be  completely  orthogonal  to  normal  RPC2  features.  The  delivery  semantics, 
failure  detection,  support  for  multiple  security  levels  and  the  ability  to  use  side  effects  all  had  to  be  retained 
when  making  a  MultiRPC  call. 

A  number  of  the  scenarios  in  which  we  envisaged  MultiRPC  being  used  required  replies  to  be  processed  by 
the  client  as  they  arrived  rather  than  being  batched.  Since  the  exact  nature  of  such  processing  was  application 
dependent  it  had  to  be  performed  by  a  client-specified  procedure.  In  addition,  wc  felt  that  it  was  important 
for  a  client  to  be  able  to  abort  the  MultiRPC  call  either  after  examining  any  reply  or  after  a  specified  amount 
of  time  had  elapsed  since  the  start  of  the  MultiRPC  call. 

Finally,  wc  wanted  MultiRPC  to  be  simple  to  use.  We  have  been  successful  in  this  even  though  the  syntax  of 
a  MultiRPC  call  is  different  from  the  syntax  of  a  simple  RPC  call,  the  latter  being  similar  to  a  local  procedure 
call.  We  have  had  to  violate  this  syntax  for  two  reasons:  to  allow  clients  to  specify  an  arbitrary  reply-handling 
procedure  in  a  MultiRPC  call,  and  to  avoid  expanding  code  size  by  generating  a  MultiRPC  stub  for  every  call 
in  a  subsystem. 

5.  Design  and  Implementation 

In  this  section  we  first  present  the  design  of  MultiRPC.  We  discuss  certain  reliability  and  performance 
problems  revealed  by  a  prototype  implementation  and  then  describe  the  refinements  made  to  alleviate  these 
deficiencies. 


5.1 .  Overall  Structure 

Support  for  MultiRPC  is  present  at  both  the  runtime  level  and  the  language  level,  reflecting  the  organisadon 
of  RPC2.  The  rundme  interface  can  be  used  independent  of  the  language  interface,  but  not  vice  versa. 

Rundme  support  is  provided  by  the  routine  MULTIRPC  that  takes  a  request  packet,  a  list  of  connections  and  a 
client  handler  roudne  as  input  and  blocks  undl  all  responses  have  been  received  or  undl  the  call  is  explicidy 
terminated  by  the  client  handler.  The  packet  which  is  transmitted  to  a  server  is  idendcal  to  a  packet  generated 
by  an  RPC2  call.  A  server  is  not  even  aware  that  it  is  participadng  in  a  MultiRPC  call. 

MultiRPC  provides  the  same  correctness  guarantees  as  RPC2,  except  when  the  client  terminates  a  call 
prematurely.  In  this  case,  a  success  return  code  indicates  that  no  connection  failures  were  detected  prior  to 
the  point  of  termination.  However,  undetected  server  failures  may  have  occurred  after  termination. 
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Language  support  for  VtukiRPC  is  provided  by  die  routines  makemli  ti  and  lnp  \ck.mllti.  1  licse  routines 
interpret  templates  called  argument  descriptor  structures  (ARGs)  generated  by  RP2Gen  to  perform  the 
packing  and  unpacking  of  parameters  in  request  and  reply  packets.  Tlie  decision  to  interpret  ARGs  at 
runtime  rather  than  to  use  precompiled  stubs  as  in  RPC2  was  motivated  by  storage  size  considerations.  The 
slight  additional  processing  cost  of  parameter  interpretation  is  outweighed  by  the  savings  in  the  code  size.  A 
consequence  of  this  is  diat  die  syntax  of  a  MultiRPC  call  no  longer  resembles  invocation  of  a  local  procedure. 
Hach  component  of  a  MultiRPC  call  does,  however,  retain  the  semantics  of  an  equivalent  RPC2  call. 

Appendix  I  describes  the  external  interface  of  MultiRPC  using  an  example.  It  presents  a  simple  RPC2 
subsystem  in  Figure  M  and  shows  typical  client  and  server  code  written  by  a  user  for  non-MultiRPC  calls  in 
Figures  1-3  and  1-4.  RP2Gen  uses  the  subsystem  definition  to  generate  a  header  file  (Figure  1-2)  as  well  as 
client  and  server  stub  files  (not  shown). 

Figure  1-5  shows  how  the  user  has  to  modify  the  client  code  to  use  MultiRPC.  There  arc  only  two  significant 
changes:  the  direct  call  to  the  client  stub  has  to  be  replaced  by  an  indirect  call  via  .makilmllti,  and  a  client 
handler  routine  has  to  be  provided  to  process  replies. 

The  ARGs  used  by  makemulti  arc  defined  by  RP2Gen  in  die  header  file.  Fach  routine  in  a  subsystem  has 
an  associated  array  of  ARGs,  with  the  type,  usage  and  size  of  each  parameter  being  specified  by  one  array 
element.  Structures  are  described  by  an  array  of  ARGs,  one  ARG  per  field.  Nested  structures  arc  described 
by  correspondingly  nested  ARGs.  At  runtime,  mak.emui.T1  traverses  the  ARG  array  and  actual  parameter  list 
in  step. 

The  client  handler  is  activated  exaedy  once  for  each  connection  specified  in  the  MultiRPC  call.  Each 
activation  corresponds  to  the  receipt  of  a  reply  or  to  detection  of  a  permanent  failure  on  that  connection.  The 
handler  enables  these  events  to  be  processed  as  soon  as  they  occur.  Its  return  code  indicates  whether  the 
.MuluRPC  call  should  be  continued  or  terminated. 

The  internal  routine  sendpacketsreliably  is  the  heart  of  the  MultiRPC  retransmission,  failure  detection 
and  result  gathering  mechanism.  It  performs  an  initial  transmission  of  requests  on  all  relevant  connections 
and  then  awaits  replies  or  timeouts.  Each  timeout  on  a  connection  causes  a  retransmission  of  the  request  to 
the  corresponding  server.  On  a  reply,  appropriate  side  effect  processing  is  performed  and  then 
UNPACK.MULTI  is  invoked.  Client-specified  timeouts  are  handled  in  this  routine. 


5.2.  Handling  Failures 

Two  factors  complicate  the  semantics  of  failures  in  MultiRPC.  First,  since  multiple  connections  arc  involved, 
how  does  one  treat  failures  on  an  individual  connection?  Should  the  entire  call  be  declared  a  failure  and 
aborted  at  that  point?  In  our  design  this  decision  is  delegated  to  the  client  handler.  This  routine  is  called  on 
each  failure  and  the  return  code  firom  it  specifics  whether  the  call  should  bo  terminated.  This  allows 
applications  to  use  a  variety  of  strategies,  such  as  termination  on  a  single  failure  or  termination  beyond  a 
threshold  of  failures.  If  an  error  such  as  an  attempt  to  use  a  dead  connection  is  detected  during  initial 
processing  of  a  MultiRPC  request,  packet  transmission  is  suppressed  on  that  connection. 

The  second  source  of  complexity  anscs  from  the  fxt  that  the  client  handler  can  terminate  a  MultiRPC  call 
before  all  replies  arc  received.  What  is  the  state  of  the  connections  on  which  replies  have  not  been  received? 
Should  these  connections  be  monitored  for  failure  after  the  call  is  terminated?  How  are  the  outstanding 
replies  dealt  with  if  they  do  arrive  eventually?  Our  strategy  is  to  pretend  ihat  a  response  has  actually  been 
received  on  each  of  the  outstanding  connections.  After  termination  of  a  MultiRPC  call  MultiRPC  increments 


ihc  sequence  luimbcr  and  resow  ilie  state  dii  caeli  siieli  eoiinectinn.  Responses  Uiat  Jo  eventually  arruc  are 
Ignored,  and  new  failures  will  not  be  dctceicd  unul  die  next  MultiRPC  or  simple  RPC2  call. 

This  abii'icy  to  terminate  a  \fultiRPC  c.ill  prematurely  interacts  with  an  orthogonal  aspect  of  RPC2  to  produce 
a  race  C'.mdition.  OnginalK.  the  RPC2  protocol  required  die  client  to  send  an  acknowledgement  to  the  server 
when  a  reply  was  received.  I'he  server  would  retry  the  reply  until  it  received  the  acknowledgement  or  until  it 
timed  v'ut.  Suppose  a  client  were  to  terminate  a  MultiRPC  call  prematurely  and  dien  immediately  make 
another  .MultiRPC  call.  Then  deadlock  could  arise  on  each  connection  on  which  a  reply  was  outstanding 
when  die  first  call  was  terminated.  The  retried  replies  by  die  server  on  that  connection  would  be  ignored  by 
the  client.  Similarly,  die  server  would  ignore  the  new  request  from  the  client.  Only  a  server  or  client  timeout 
could  end  :hc  dcadlcck.  This  problem  would  be  compounded  if  the  client  tcmiinated  the  second  call 
prematurely,  and  dicn  ci  ntinucd  with  fiirdicr  .MultiRPC  calls.  I'he  client  could  continue  indefinitely  in  this 
mode  without  realising  diat  the  connection  was  functionally  dead. 

Our  solution  to  this  problem  is  to  send  an  explicit  negative  acknowledgement  if  a  packet  with  a  sequence 
number  higher  than  expected  is  received,  lliis  enables  bodi  the  client  and  the  server  to  immediately  detect 
the  failure  mode  described  above.  Because  the  RPC2  protocol  no  longer  requires  replies  to  be  acknowledged, 
this  fix  is  now  superfluous.  However,  we  retain  it  to  allow  prompt  identification  of  connections  that  have 
been  marked  unusable  by  a  server  for  otlicr  reasons  such  as  side-effect  failures. 

.•\nothcr  possible  failure  mode  relates  to  the  client  handler  routine.  During  an  excessively  long  computation 
in  this  routine,  the  internal  buffers  in  Unix  will  be  filled  with  incoming  replies  and  further  replies  will  be  lost. 
This  has  the  effect  of  increasing  retransmissions  and  hcncc  degrading  performance.  In  addition,  logical  errors 
can  arise  if  the  client  handler  is  not  reentrant  but  yields  control.  This  can  happen,  for  instance,  if  tlic  client 
handler  m.akes  an  RPC  during  its  processing.  Writing  a  client  handler  is  thus,  in  many  ways,  similar  to  writing 
an  interrupt  handler  in  an  operating  system. 

5.3.  Evolution 

Experience  with  an  initial  prototype  of  MultiRPC  led  us  to  make  a  number  of  changes  pertaining  to  function 
and  performance.  The  changes  relating  to  function  have  been  mentioned  in  Section  5.2.  In  this  section  we 
describe  tlic  changes  chat  wc  made  to  improve  the  performance  of  MultiRPC. 

Early  trials  of  MultiRPC  showed  a  suprisingly  large  number  of  retried  packets,  even  when  the  number  of 
seners  being  contacted  was  relatively  small.  Careful  examination  of  the  code  showed  that  most  of  these 
packets  were  not  being  lost,  but  were  being  discarded  after  receipt.  It  turned  out  that  the  low-lcvcl  RPC2 
code  first  checked  for  liincd-out  events  and  then  checked  for  packet  arrivals.  For  a  MultiRPC  call  to  many 
sites,  the  total  time  to  transmit  all  requests  exceeded  the  first  retransmission  interval.  Replies  from  the  first 
few  servers  were  discarded  because  they  corresponded  to  events  that  had  timed  out.  To  fix  this  problem  we 
now  time  out  events  only  after  receiving  all  packets  that  have  arrived. 

Another  change  was  made  to  the  same  piece  of  low-level  code  to  reduce  the  number  of  LWP  context  switches. 
Rather  than  yield  control  on  each  received  packet,  the  code  now  yields  control  only  after  all  available  packets 
have  been  received.  In  MultiRPC  this  reduces  context  switches  because  all  these  packets  are  destined  for  the 
same  client  LWP,  This  is  in  contrast  to  the  situation  in  simple  RPC2  calls  where  the  semantics  of  RPC 
guarantees  that  a  client  LWP  can  be  waiting  for  at  most  one  packet 

A  third  change  addressed  the  fact  that  Unix  provides  only  a  limited  amount  of  buffering  in  tlic  kernel  for 
incoming  packets.  For  a  sufficiently  large  number  of  servers  in  a  MultiRPC  call,  enough  replies  could  arrive 
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and  process  these  packets  l''ie  ..dJitional  nroccsMng  further  ^.Ic'Acd  the  client  and  caused  it  to  lose  new 
.ephes,  thus  IcaJing  ro  an  unstable  mode  of  opeiaiiun.  For  this  reason,  as  \scll  as  the  failure  maide  desenbed 
in  Section  5.2  .md  other  reasiins  independent  of  MuliiRFC,  ac  have  changed  die  reliable  tr.ansmission 
protocol  to  no  longer  acknowlcdae  replies. 

Tlic  current  implernentauon  of  MultiRPC  incoroorates  all  the  modifications  described  in  this  section  and  a 
number  of  other  minor  changes.  I  hc  performance  measurements  described  in  Section  6  were  obtained  with 
this  implementation. 

6.  Performance 

The  performance  measure  that  best  characienses  MultiRPC  is  die  ratio  of  the  elapsed  time  for  using  RPC2 
Iteratively  (r).  to  the  elapsed  time  for  using  .MultiRPC  (m).  This  ratio  (r/m).  as  a  function  of  die  number  of 
sites  contacted  (n).  is  the  speedup  realised  by  a  MultiRPC  implcmenution.  Although  linear  speedup  is  clearly 
desirable,  MultiRPC  can  be  valu.iblc  even  with  modest  speedup.  In  the  application  which  motivated 
MultiRPC.  for  insuincc.  rapid  failure  detection  was  of  much  greater  concern  than  speedup  of  processing.  This 
benefit  of  using  .MultiRPC  exists  even  if  there  is  no  speedup  of  processing. 

In  this  section  we  assess  the  speedup  of  MultiRPC  in  three  steps.  We  first  present  an  analytic  model  in 
Section  6.1.  validate  this  model  using  data  from  controlled  experiments  in  Section  6.2.2,  and  then  present,  in 
Section  6.2.3.  dau  from  large-scale  experiments  where  the  assumptions  behind  our  model  arc  violated.  The 
raw  data  for  the  measurements  and  the  analytic  model  predictions  arc  found  in  Appendix  111,  and  arc 
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presented  graphically  in  Appendix  IV. 


6.1 .  Analytic  Model 

Our  goal  in  this  section  is  to  derive  an  analytic  model  that  can  predict  tlic  behaviour  of  MultiRPC.  Although 
the  simplifying  assumptions  we  make  may  not  strictly  hold  in  practise,  they  are  acceptable  for  the  level  of 
accuracy  we  arc  trying  to  achieve. 

The  most  important  assumption  deals  with  network  topology  and  latency.  In  most  distributed  systems,  the 
actual  transit  time  on  the  network  is  a  small  fraction  of  tlie  processing  time  spent  in  sending  and  receiving  a 
packet.  Routers  or  other  interconnecting  elements  on  a  multi-segment  network  can.  however,  increase  latency 
considerably.  For  the  purposes  of  our  model,  we  assume  that  the  client  and  all  the  servers  arc  on  a  single- 
segment  netw  ork  that  has  negligible  latency. 

A  second  assumption  relates  to  mutual  interference  and  loss  of  packets.  Although  MultiRPC  is  built  on 
unreliable  datagrams,  the  actual  probability  of  packet  loss  is  quite  low,  typically  below  1  percent.  However,  as 
more  servers  are  contacted,  the  probability  of  packet  loss  increases  because  of  limited  buffering  capability  at 
the  client.  F.ven  if  packets  are  not  lost,  race  conditions  between  the  client  and  the  servers  can  cause  packet 
retransmissions.  We  ignore  all  these  complications  and  assume  that  there  are  no  lost  or  retried  packets  during 
a  MultiRPC  call. 

Finally,  we  assume  that  each  server  takes  a  constant  amount  of  time  to  service  a  request  and  that  this  time  is 
uniform  across  all  servers.  This  assumption  is  valid  to  a  first  approximation  even  though  the  specific  nature  of 
the  request,  the  presence  of  other  processing  activity  at  the  servers,  and  slight  differences  in  hardware 
performance  can  result  in  nonuniform  service  times. 

A  MultiRPC  call  can  be  decomposed  into  the  following  components: 

pack  Packing  of  arguments  by  client 

cloh  Protocol  and  kernel  processing  by  client  to  send  request 

servoh  Protocol  and  kernel  processing  by  server  to  receive  request  and  send  reply. 

clproc  Protocol  and  kernel  processing  by  client  to  receive  reply 

unpack  Unpacking  of  arguments  and  processing  in  client  handler. 

In  terms  of  the  MultiRPC  implementation  described  in  Section  S.l,  pack  is  the  time  taken  by  the  routine 
MAKEMULTI,  cloh  corresponds  to  the  time  in  MULTIRPC  and  the  initial  part  of  sendpacketsreliably,  clproc 
corresponds  to  the  remainder  of  sendpacketsreliably,  and  unpack  is  the  time  taken  by  unpackmulti  and 
a  call  to  a  null  client  handler  routine.  Since  MultiRPC  and  simple  RPC2  calls  arc  indistinguishable  at  the 
servers,  servoh  is  the  same  for  both  iterative  RPC2  calls  and  MultiRPC.  This  is  the  total  time  taken  to  receive 
a  request  and  to  send  a  reply,  assuming  zero  processing  time.  We  include  a  separate  term  compiime  to 
account  for  the  application  processing  at  a  server. 

The  pack  component  is  performed  only  once,  regardless  of  the  number  of  servers  being  contacted.  The  servoh 
and  compiime  components  overlap  at  the  servers.  Ail  the  other  components  have  to  performed  once  for  each 
server.  In  terms  of  these  components,  the  total  time  m  for  a  MultiRPC  call  to  n  sites  can  be  expressed  as 

m  =  pack  +  (nxcloh)  -F  {servoh  -F  comptime)  -F  {nxclproc)  -F  (nxunpack) 

Unfortunately  this  expression  contains  an  oversimplification  that  affects  the  model  predictions  significantly. 
Suppose  waiiiime  is  the  elapsed  time  between  the  sending  of  the  last  request  and  the  receipt  of  the  first  reply 
by  the  clienL  For  a  single  server,  waittime  will  be  the  sum  of  servoh  and  compiime.  Fur  a  large  enough 
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number  of  servcrs.howovcr,  the  reply  from  the  first  server  may  be  available  before  the  last  request  is  sent  out; 
in  tills  case  wainime  is  zero. 


Assuming  that  the  time  to  send  a  packet  is  sendtime.  the  expression  for  m  can  be  refined  as  follows: 
wailiime  =  {senoh  +  comptime)  — ((n—  l)xsenJtime) 
if (waiiii)ne  <  0)  then  waiitune  =  0 

m  =  pack  +  (nXcloh)  +  wailtime  +  {nxciproc)  +  (nxunpack) 

If  a  null  RPC2  call  takes  rpetime,  the  total  time,  r,  to  contact  n  servers  using  iterative  RPC2  calls  is  given  by 
r  =  nx{rpcttme  +  comptime). 

The  times  for  the  individual  components  in  our  implementation  were  obtained  by  actual  measurement  and 
are  presented  in  Table  111-2.  Using  these  values  in  tlie  expressions  derived  above  we  can  calculate  the 
quantities  r,  m  and  r/m  for  server  computation  times  of  10.  20  and  50  milliseconds.  These  predicted  values 
are  presented  in  Table  III-6  and  shown  graphically  in  Figure  IV-4. 

Most  systems  with  parallelism  initially  exhibit  linear  speedup,  then  show  sublincar  speedup,  and  finally 
saturate.  Figure  IV-4  shows  that  MuItiRPC  conforms  to  this  expected  behaviour.  However,  two  deuiled 
observations  are  apparent  from  this  graph.  First,  saturation  occurs  at  surprisingly  low  levels  of  speedup. 
Second,  the  level  at  which  speedup  saturates  and  the  number  of  servers  at  which  saturation  sets  in  are  both 
dependent  on  the  server  computation  time. 

The  server  computation  time  is  central  to  MuItiRPC  because  most  of  the  parallelism  comes  from  overlap  of 
scP'er  computations.  The  sending  of  requests  and  processing  of  replies  at  the  client  are  done  sequentially. 
The  realisable  speedup  depends  on  how  long  these  operations  take  in  comparison  to  the  time  spent  at  the 
server. 


We  can  quantify  this  reasoning  in  the  following  way.  Our  measurements  show  that  the  dominant  components 
of  MuItiRPC  arc  the  time  to  send  a  request  and  the  time  to  receive  a  reply.  Suppose  the  sum  of  these 
quantities,  called  systime,  is  identical  in  MuItiRPC  and  RPC2.  Further,  let  the  total  time  spent  at  a  server 
(equal  to  the  sum  of  servoh  and  comptime)  be  semime.  Then  the  MuItiRPC  and  RPC2  call  times  to  n  servers 
can  be  crudely  approximated  as  follows: 
m  =  (nxsystime)  +  servtime 
r  =  (nxsystime)  -f  (nxservtime) 


r  _  systime  -f  servtime 
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The  above  expression  clearly  shows  that  the  speedup  is  sensitive  to  the  value  of  T.  In  the  limit,  as  n  tends  to 


infinity,  die  value  of  r/m  tends  to  1  -f-  7.  This  accounts  for  the  fact  that  die  saturation  value  of  die  speedup  is 
higher  for  longer  server  computation  times.  Because  the  times  to  send  and  receive  a  packet  dominate  systime, 
improvements  to  the  underlying  network  primitives  will  improve  the  maximum  speedup  obtained  with 
MultiRPC.  Consequently,  an  improved  basic  RPC  mechanism  would  result  in  improved  MultiRPC 
performance  rather  than  rendering  it  superfluous. 


6.2.  Experimental  Results 

We  performed  a  series  ofcarehilly  controlled  experiments  to  confirm  our  understanding  of  MultiRPC  and  to 
explore  its  behaviour  when  contacting  a  large  number  of  servers.  We  describe  our  experiments  and  the 
observations  from  them  in  the  next  three  sections. 

6.2.1 .  Experimental  Environment 

The  experiments  were  conducted  in  an  environment  of  about  500  Sun3,  DEC  MicroVax,  and  IBM  RT-PC 
workstations  running  the  Unix  4.2BSD  operating  system  and  attached  to  the  Andrew  File  System.  For 
uniformity,  we  ran  our  tests  only  on  the  IBM  RT-PC  workstations,  with  one  of  the  workstations  being  the 
client  and  the  others  servers.  By  designing  our  tests  to  require  no  file  accesses  or  system  calls  on  the  servers 
we  avoided  distortions  of  our  measurements  by  distributed  file  system  access. 

The  topology  of  the  network  connecting  tliesc  workstations  is  shown  in  Appendix  II.  Although  physically 
compact^,  there  is  considerable  complexity  in  the  network  structure.  There  are  about  a  dozen  F.thernet  and 
IBM  Token  Ring  subnets  connected  to  each  other  directly  or  via  optic  fibre  links.  Active  computing 
elements,  called  rouien,  perform  the  appropriate  forwarding  or  filtering  of  packets  between  these  subnets.  In 
addition  to  the  Andrew  workstations,  many  standalone  workstations  and  mainframe  computers  arc  also  on 
this  network. 

The  scale  and  complexity  of  the  network  introduced  serious  problems  in  controlling  our  experiments.  We 
had  to  be  on  guard  against  extraneous  network  activity  loading  the  network  and  tlic  routers.  We  also  had  to 
contend  with  the  fact  that  closely-spaced  replies  to  a  MultiRPC  call  from  a  large  number  of  servers  could 
overload  the  routers  and  affect  our  measurements. 

To  address  these  problems  we  separated  our  experiments  into  two  classes.  TTic  tests  for  model  validation, 
discussed  in  Section  6.2.2,  were  run  entirely  on  workstations  located  on  a  single  subnet.  For  these  tests  we 
were  able  to  ensure  that  there  was  no  otlicr  network  activity.  The  presence  of  routers  was  irrelevant  since  the 
client  and  all  the  servers  were  on  the  same  subnet  Since  there  were  only  20  workstations  on  tliis  subnet  our 
model  validation  is  restricted  to  this  range. 

The  tests  discussed  in  Section  6.2.3  to  explore  the  large-scale  performance  of  MultiRPC  could  not  be 
controlled  so  well.  To  include  more  than  20  servers  our  tests  had  to  span  multiple  subnets.  We  minimised  the 
effects  of  extraneous  network  activity  by  running  our  experiments  in  the  early  hours  of  the  morning,  when  the 
network  was  least  active.  Preliminary  tests  confirmed  that  this  consistently  produced  smaller  variances  in  our 
measurements  than  tests  run  at  any  other  period  of  the  day. 

We  simulated  computation  on  the  servers  by  delaying  the  reply  to  a  request  by  a  specified  amount  of  time. 
Unfortunately,  the  clock  resolution  of  16  milliseconds  on  the  RT-PCs  was  inadequate  for  the  range  of 
computation  times  of  interest  to  us.  We  therefore  had  to  resort  to  a  timing  loop  to  achieve  delays  of  10,  20 


entire  CMU  campus  is  only  about  one  ^uare  kilometer  in  size. 


and  50  milliseconds  in  our  tests.  The  clock  resolution  was  also  inadequate  for  measuring  die  elapsed  lime  of 
individual  calls  at  die  client.  To  overcome  this,  we  timed  many  iterations  of  each  call  and  used  die  average. 

6.2.2.  Validation 

The  worst  performance  of  MultiRPC,  relative  to  RPC2,  occurs  when  a  single  site  is  contacted.  In  this 
situation,  the  additional  comple.\ity  of  a  MultiRPC  call  is  not  amortised  over  many  connections.  Table  lll-l 
compares  the  elapsed  times  to  contact  a  single  site  using  RPC2  and  MultiRPC,  for  zero  server  computation 
time.  The  table  shows  that  the  difference  in  times  is  negligible.  Further,  the  RPC2  dme  presented  in  the 
uiblc  is  no  worse  than  the  time  observed  on  an  earlier  version  of  RPC2  that  lacked  MultiRPC  support.  Our 
design  criterion  of  not  slowing  down  simple  RPCs  has  thus  been  met. 

To  compute  the  model  predictions  we  need  to  know  die  dmes  taken  by  individual  components  of  MultiRPC. 
These  values,  obtained  by  standalone  measurements  of  MultiRPC,  are  presented  in  Table  III-2.  By 
substituting  these  values  in  the  expressions  derived  in  Section  6.1  we  obtain  the  model  predictions  shown  in 
Table  III-6  and  Figure  IV-4. 

Figures  IV-5,  IV-6  and  lV-7  compare  the  predictions  of  the  model  to  our  measurements.  The  model 
consistently  predicts  slightly  better  performance  than  we  actually  observe,  but  the  fit  is  suprisingly  good 
considering  the  simplicity  of  the  model.  For  the  reasons  discussed  earlier  we  were  unable  to  investigate  the 
validity  of  the  model  beyond  20  servers. 

6.2.3.  Large  Scale  Effects 

Since  the  original  motivation  for  MultiRPC  involved  a  scenario  with  a  large  number  of  workstations 
distributed  over  the  entire  CMU  campus,  we  were  curious  to  see  just  how  well  MultiRPC  behaves  in  such  an 
environment.  Our  analytical  model  is  not  valid  in  this  situation  because  of  tlic  presence  of  routers  and 
extraneous  network  traffic.  These  factors  affect  both  RPC2  and  MultiRPC.  Their  effects  show  up  as 
anomalies  in  the  average  values  of  tlte  measured  quantities,  and  as  high  associated  variances. 

Table  III-3  and  Figure  IV-1  present  our  observations  for  server  computation  times  of  10,  20  and  50 
milliseconds.  The  behaviour  below  about  25  servers  is  in  accordance  with  our  model.  Beyond  this  the 
speedup  drops  rather  tlian  increasing  or  remaining  constant  Detailed  examination  of  the  data  revealed  an 
increase  in  the  number  of  retried  packets  beyond  about  25  servers.  Below  25  servers,  retries  never  comprised 
more  than  0.4%  of  the  tout  packets  sent  in  a  MultiRPC  test  In  fact  for  those  configurations  there  were  often 
no  retried  packets  at  all.  Beyond  25  servers  the  number  of  retried  packets  increased,  and  sometimes 
accounted  for  as  much  as  25  percent  of  the  total  traffic  in  configurations  beyond  50  servers. 

The  measurements  described  above  were  performed  with  server  compuution  times  that  were  constants.  We 
conjectured  that  one  of  the  main  reasons  for  poor  behaviour  at  large  scale  was  the  overloading  of  routers  due 
to  simultaneous  arrival  of  replies  from  many  servers.  Real  server  comnutations  tend  not  to  be  constant  We 
therefore  repeated  the  experiments  with  computation  times  normally  distributed,  with  a  standard  deviation 
that  was  10%  of  the  mean.  Table  III*4  and  Figure  IV-2  show  the  corresponding  results.  We  also  repeated  the 
experiments  with  an  exponentially  distributed  service  time,  to  provide  validation  data  for  a  stochastic  model 
that  might  be  built  in  the  future.  This  data  is  shown  in  'fable  111*5  and  Figure  IV-3.  Unfortunately,  as  the 
results  from  these  experiments  show,  the  aberrations  in  performance  at  large  scale  do  not  vanish  when  using  a 
non*constant  service  time  distribution. 

Although  there  is  a  decline  in  obser/ed  speedup  in  the  large  scale  tests,  it  must  be  emphasised  that  MultiRPC 
performance  is  always  better  than  iterative  RPC2  performance.  Further,  we  encountered  no  functional 
problems  in  using  MultiRPC  up  to  100  servers.  These  facts  give  us  confidence  in  tlic  value  of  MultiRPC  as  a 
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basic  component  of  large  distributed  systems. 

7.  Related  Work 

In  this  section  we  look  at  a  number  of  parallel  RPC  mechanisms  with  a  view  to  placing  the  design  of 
MultiRPC  in  perspective.  We  make  no  attempt  to  be  e.'chauslive  in  our  examples  or  to  be  complete  in  our 
descriptions.  Rather,  our  goal  is  to  examine  tltesc  systems  in  a  manner  that  highlights  tlicir  similarities  and 
differences  with  respect  to  MultiRPC. 

Work  in  the  area  of  parallel  network  access  has  typically  focused  on  broadcast  or  multicast  protocols.  Two 
examples  of  such  work  are  tlie  Sun  Microsystems’  Broadcast  RPC  (Sun-RRPQ  [18],  and  Group  Interprocess 
Communication  in  the  V  Kernel  ( V-GIPC)  (3). 

Sun-BRPC  depends  upon  IP-levcl  broadcast  to  communicate  with  multiple  sites.  Servers  must  register 
themselves  in  advance  with  a  central  port  in  order  to  be  accessible  via  the  broadcast  facility.  ITiis  is  in  contrast 
to  MultiRPC  where  no  explicit  actions  need  be  taken  by  the  server;  existing  servers  do  not  have  to  be 
modified,  recompiled  nor  relinked.  Like  MultiRPC,  Sun-BRPC  provides  for  a  client  handler  routine  and  an 
overall  client-specified  timeout.  However  it  docs  not  provide  the  same  correctness  guarantees  and  error 
reporting  as  MultiRPC. 

V-GIPC  uses  the  Ethernet  multicast  protocol  as  its  basis  and  defines  host  groups  as  message  addresses.  Each 
request  is  multicast  and  it  is  up  to  each  host  to  recognise  those  group  addresses  for  which  it  has  local  processes 
as  members.  Reliable  communication  is  not  an  objective  of  V-GIPC,  even  though  its  designers  report  that 
lost  responses  due  to  simultaneous  arrival  of  packets  arc  common.  Another  difference  is  that  there  is  no 
notion  of  a  client  handler  in  V-GIPC.  A  client  is  blocked  only  until  the  first  response  is  received;  further 
responses  have  to  be  cxplicidy  gathered.  When  another  call  is  made,  the  previous  call  is  implicitly  terminated 
and  further  responses  to  it  are  discarded. 

Neither  Sun-BRPC  nor  V-GIPC  allows  long  server  computations  at  all  sites  in  a  parallel  call  while  providing 
timely  notification  of  site  or  network  failures.  A  sufficiently  long  computation  would  simply  cause  a  timeout. 
The  MultiRPC  retransmission  protocol  addresses  this  issue  and  allows  the  client  to  distingush  between  a  long 
computation  and  permanent  communication  failure. 

There  are  also  parallel  RPC  systems  which  do  not  .depend  on  broadcast  nor  multicast.  One  such  system  is 
Circus  [4,  S],  which  focuses  on  achieving  high  availability  by  using  parallel  RPC  as  a  vehicle  for  replication. 
Circus  is  built  on  top  of  the  DARPA  IP/UDP  protocol  and  is  unique  in  that  it  supports  many-to-many 
communication  rather  than  onc-to-many.  It  provides  for  a  fixed  set  of  routines  called  collators,  one  of  which 
must  be  specified  by  the  client  when  making  a  call.  Collators  perform  a  function  similar  to  the  MultiRPC 
client  handler  routine,  but  differ  in  that  they  do  not  allow  a  call  to  be  terminated  prematurely.  Like 
MultiRPC.  Circus  uses  probe  packets  to  distinguish  between  long  server  computations  and  permanent 
communication  failure.  Many  problems  addressed  by  Circus,  such  as  orphan  detection  and  exactly-once 
semantics  in  the  presence  of  failures,  are  unique  to  its  intended  application. 

Ihe  Gemini  parallel  RPC  mechanism  [2, 11]  is  built  on  the  reliable  IP/TCP  byte  stream  protocol  [8],  that 
subsumes  the  retransmission,  timeout,  acknowledgement  and  probe  functions  of  RPC.  It  is  similar  to 
MultiRPC  in  that  it  requires  no  multicast  or  broadcast  support.  Unlike  MultiRPC,  a  distinct  Unix  process  is 
created  on  a  server  for  each  client.  The  stub  compiler  for  Gemini  accepts  interface  specification  in  C  rather 
than  defining  a  separate  interface  language,  llic  equivalent  of  the  MultiRPC  client  handler  routine  is  a 
language  construct  called  a  result  statement,  'Phis  is  a  compound  statement  in  C  that  syntactically  appears 


after  a  parallel  remote  procedure  call.  I'liis  body  of  code  is  executed  exactly  once  fur  each  reply  and  the 
e,\ccution  can  teiTninate  tiic  entire  call  prematurely. 


8.  Conclusion 

The  central  message  of  tliis  paper  is  tliat  it  is  possible  to  build  an  efficient  and  easy  to  use  parallel  invocation 
mechanism  whose  semantics  is  a  natural  extension  of  tlie  remote  procedure  call  paradigm.  We  have  derived 
an  analytic  model  of  tins  mechanism  and  shown  that  its  predicted  performance  closely  matches  the  measured 
performance  of  our  implementation.  Asymptotic  analysis  of  this  model  indicates  that  improvements  to  the 
underlying  transmission  primitives  would  further  strengthen  the  case  for  using  MultiRPC  in  preference  to 
iterative  RPCs.  We  have  demonstrated  experimentally  that  the  mechanism  works  successfully  for  up  to  100 
servers  in  a  single  call,  executing  in  a  complex  network  environment  with  diverse  transmission  media  and 
interconnecting  elcmciits.  Comparison  with  other  parallel  invocation  mechanisms  shows  that  MultiRPC  is 
unique  in  its  overall  design,  although  some  of  the  individual  concepts  used  in  it  may  be  found  elsewhere. 

There  arc  a  number  of  ways  in  which  the  work  reported  here  may  be  extended.  First,  MultiRPC  provides 
parallelism  at  the  programming  interface  but  does  not  require  multicast  capability  in  any  of  the  lower  levels  of 
the  networking  software.  Suppose,  however,  multicast  were  available.  How  could  MultiRPC  use  it?  What 
would  its  performance  behaviour  be  then?  Our  view  is  that  multicast  is  a  performance  enhancement  rather 
than  a  fundamental  programming  primitive.  Preliminary  work  indicates  that  if  the  programmer  is  required  to 
explicitly  group  connections,  the  semantics  of  MultiRPC  can  be  preserved  while  using  multicast  internally. 
An  interesting  security  question  arises  in  this  context.  How  does  one  support  site-specific  encryption  on 
multicast  packets?  Our  approach  is  to  use  the  underlying  secure  RPC2  connections  as  key  distribution 
channels  for  internally  generated  group-specific  keys.  We  will  report  on  our  experience  with  this  approach 
and  further  details  on  the  use  of  multicast  in  a  later  paper. 

Another  potential  improvement  to  MultiRPC  involves  ARGs.  RP2Gcn  does  not  take  advantage  of  repeated 
argument  types  when  it  generates  ARGs;  it  creates  a  new  ARG  for  each  parameter  of  each  remote  operation. 
For  recursive  structure  arguments  this  can  consume  a  significant  amount  of  storage.  Since  structure 
arguments  are  defined  in  the  subsystem  specification  file,  and  hence  known  to  RP2Gcn,  it  should  be  possible 
to  share  ARGs. 

Finally,  using  software  in  a  variety  of  applications  often  results  in  unexpected  lessons  and  refinements.  It  is 
impossible  to  predict  in  advance  what  such  changes  might  be  in  die  context  of  MultiRPC.  However,  we  are 
confident  that  MultiRPC  and  mechanisms  similar  to  it  will  prove  to  be  an  important  building  block  in 
distributed  systems. 
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I.  An  Example 

This  appendix  presents  a  brief  example  in  order  to  make  some  of  tlic  previous  discussion  more  concrete. 
Although  this  example  is  contrived,  it  is  adequate  to  illustrate  the  structure  of  a  client  and  server  that 
communicate  via  RPC2,  and  the  changes  that  must  be  made  to  tlie  client  to  use  MultiRPC. 

The  subsystem  designer  defines  a  subsystem,  chooses  a  name  for  it,  and  writes  the  specifications  in  the  file 
<subsysiemmtne>.rpc2. 1  his  file  is  submitted  to  RP2Gen,  which  generates  client  and  server  stubs  and  a  header 
file.  RP2Gen  names  its  generated  files  using  tlie  file  name  of  tlie  .rpc2  file  with  the  appropriate  suffix. 

Figure  M  presents  the  specification  of  the  subsystem  example  in  the  file  e.xample.rpc2.  RP2Gen  interprets 
this  specification  and  produces  a  client  stub  in  the  file  examplexlient.c,  a  server  stub  in  example.semer.c  and  a 
header  file  example.h  (shown  in  Figure  1-2).  This  subsystem  is  composed  of  two  operations,  doublejl  and 
inpleji.  These  procedures  both  take  a  call  by  value-result  parameter,  teuvc'.  containing  both  the  integer  to  be 
operated  on  and  tlte  result  returned  by  the  server. 

Once  the  interface  has  been  specified,  the  subsystem  implementor  is  responsible  for  exporting  the  subsystem 
and  writing  the  server  main  loop  and  the  bodies  of  the  procedures  to  perform  the  server  operations.  A  client 
wishing  to  use  this  server  must  first  bind  to  it  and  tlien  perform  an  RPC  on  that  connection.  Figures  1-3  and 
1-4  illustrate  the  client  and  server  code.  We  wish  to  emphasise  that  tliis  code  is  devoid  of  any  considerations 
relating  to  MultiRPC. 

Now  consider  extending  this  example  to  contact  multiple  servers  using  MultiRPC.  The  example.rpc2  file  and 
the  server  code  remain  exactly  the  same.  Argument  Descriptor  Suiictures  (ARGs),  used  by  MultiRPC  to 
marshall  and  unmarshall  arguments,  are  already  present  in  the  client  stub  file;  pointers  to  dicse  structures  arc 
defined  in  the  exampleJi  flic.  Only  cite  client  code  has  to  be  modified,  as  shown  in  Figures  1-5  and  1-6. 

From  the  client’s  perspective,  a  MultiRPC  call  is  slightly  different  from  a  simple  RPC2  call.  The  procedure 
invocation  no  longer  has  the  synux  of  a  local  procedure  call.  Instead,  the  single  library  routine  makemulti  is 
used  to  access  the  runtime  system  routine  MULTIRPC.  Tlie  client  must  allocate  all  necessary  parameter 
storage.  IN  arguments  are  simply  supplied  as  for  any  procedure  call,  but  for  OUT  and  IN -OUT  parameters 
arguments  arrays  of  pointers  to  the  appropriate  types  must  be  supplied  to  the  MAICE.MULTI  routine.  The  return 
arguments  .from  each  of  the  servers  will  be  placed  in  the  appropriate  elements  of  the  arrays. 

The  client  is  also  responsible  for  supplying  a  handler  routine  for  any  server  operation  which  is  used  in  a 
MultiRPC  call.  The  handler  routine  is  called  by  MultiRPC  as  each  individual  server  response  arrives, 
providing  an  opportunity  to  perform  incremental  bookkeeping  and  analysis.  The*  return  code  firom  the 
handler  gives  the  client  control  over  the  continuation  or  termination  of  the  MultiRPC  call. 

RPlGtn  spteiflcationfllejbr  timplt  subsystem 

Server  Prefix  ‘‘serv’ ;  tag  server  operations  to  avoid  ambiguity 

Subsystem  "Example*; 

doubleJKIN  OUT  RPCZ.Intcger)  testvil); 
tnpteJt(IN  OUT  RPQ.Inieger)  testval); 


Figure  M:  example.rpc2  specification  file 


IS 


h  file  produced  by  RP2CI-N 
Inpul  file:  example.  rpc2 

#  include  "rpc2.h" 

^include  "se.h" 

Op  codes  and  definitions 

^define  doubleJt_OP  1 
extern  ARG  tloublc_it_ARGS(  ]; 

#  define  doublc_it_PTR  doublc_it_ARGS 

#  define  tiiplc_it_OP  2 
extern  ARG  triple_it_ARGS( ): 

#  define  triplc_it_PTR  triple_it_ARGS 

Figure  1-2;  The  RP2Gen-gcneratcd  header  file  example.h 

Include  relevant  header  files 

mainO 

{ 

int  testval,  op; 

RPC2_Handledd: 

Perform  LWP and RPC2  Initialisation 

Establish  connections  to  server  using  RPC2^Blnd 
RPC2Jiini{  Bind  arguments,  icid); 

while  (TRUE) 

{ 

printf("\nDouble  [  =  1]  or  Triple  [  =  2)  (type  0  to  quit)?  "): 
scanfC'Tid",  op): 
if  (op  =  =  0) 
break; 

printffNumber?  "); 
scann["%d",  testval): 

Perform  RPC2  call  on  connection  cii 
if  (op  =  =  1)  doubleJt(cid.  testval); 
else  tripleJKdd,  testval)! 
printfC' result  =  %d\n".  testval); 

) 

RFC2_Unbind(dd);  Terminate  connection  with  server 

printl("Byt..Nn"): 

} 


Figure  1-3:  A  Simple  RPC2  Client 


Include  relevant  header  files 


mainO 

{ 

RPC2_Pacl£eiBuffer  ‘reqbuffcr; 

RPC2_Handle  cid; 

Perform  LWP and  RPC  initialization 

Set  filter  to  accept  requests  on  new  or  existing  connections 

Enter  server  loop 
\vhile(TRUE) 

{ 

Await  a  client  request: 

RPC2_GetRequc.st(<Stcid.  &rijqbuffcr.  Other  Arguments) 
serv  ExccuteRequesUcid,  reqbuffer)  Routine  is  generated  by  RPlGen 

} 

} 

Bodies  of  server  procedures 
long  scrv_double_ii(cid,  teslval) 

RPCajlandle  cid; 

R1’C2  Integer ‘testval; 

{ 

•testval  =  (•teslval)  •  2; 
retum(RPC2_SUCCESS)): 

} 


long  serv_triple_it(dd,  testval) 
RPCl.Handledd: 

RPC2  Integer 'testval; 

{ 

•testval  =  ('testval)  •  3: 
reium(RPC2_SUCCESS): 

} 


Figure  1-4:  A  Simple  RPC2  Server 
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Include  relevant  header  flies 

#  define  HOWMANY  3 

#  define  WAITFOR  2 
long  HandleDoubleO; 
long  HandleResultQ: 
long  returns; 

mainO 

{ 

int  testval(I  JOWMANY],  op;  Can  use  static  or  dynamic  allocation 

RPC2_Handle  dd[HOWMANYl; 

Perform  LWP and  RPC2  Initialization 

Establish  connections  to  servers 

for(count  =  0;  eount<  HOWMANY;  count  +  +) 

{ 

ret  =  RPC2_Bmd(5iW  arguments,  &cid[count]); 

testval[count]  =  {int*)malloc(sizeofl[int));  allocate  space  for  arguments 

} 

while  (TRUE) 

{ 

printf("\nDouble  (  =  1]  or  Triple  [  =  2)  (type  0  to  quit)?  "); 
scanf(“%d“,  op); 
if  (op  =  =  0)  break; 
printf("\nNumber?  “); 

scanf("%d",  •tcstval(01):  IN  argument  goes  In  1st  array  slot 

Make  tkt  MuItlRPC  call 
returns  =  0; 

bzero(tesival,  HOWMANY*sizcof(int)):  initialize  results 

if  (op  ==  1) 

MakeMulti(double_it.OP.  doublc_it_PTR,  HOWMANY.  dd, 

HandleOouble,  NULL,  testvai); 
else  MakcMulU(triplc_it_OP.  triple_it_PTR,  HOWMANY,  dd. 

HandlcTriple,  NULL,  testvai); 
for  (count  =  0;  count  <  HOWMANY;  count+  +) 
print((”resultt%d]  =  %d\n",  count.  testval[countD: 

} 

for  (count  =  0;  count  <  HOWMANY;  count+  +) 

RPC2_Unbiod(dd(oount]); 

ptintfl["Bye...\n"): 

} 


Figure  1*5:  Client  using  MuItiRPC 


long  IIandleDouble(How.Many,  ddarray.  host,  rpcval.  testval) 

RPC2_Intcger  How.Many,  host,  rpcval.  •testvalQ; 

RPC2  Handle  cidarrayQ; 

{ 

[^(rpcval  !=  RPC2_SLCCESS) 

prininC'HandleDouble:  rpcval  =  %d\n".  rpcval); 

IfT  returns  =  =  WAITFOR)  return  -1:  Terminate  the  MultiRPC  call 

return(O):  Continue  accepting  server  responses 

} 

long  HandleTriplc(HowMany.  ddarray.  host,  rpcval.  testval) 

RPC2_Integer  HowMany,  host,  rpcval.  •testval(]; 

RPC2  Handle  ddarrayQ; 

{ 

if  (rpcval  1=  RPC2,SLCCILSS) 

printR"HandleTriplc:  rpcval  =  %d\n".  rpcval); 
iff-t-  -<■  returns  =  =  WAITFOR)  return  -1;  Terminate  the  MulttRPC  call 
return(O);  Continue  accepting  server  responses 

} 

Figure  1-6;  MultiRPC  Client  Handler  Routines 


II.  Network  Topology 
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Figure  IM:  Canicgic  Mellon  Network  Topology 


All  times  are  in  milliseconds.  Figures  in  parentheses  are  standard  deviations.  A  subset  of  this  data  is  graphically 
presented  Figure  IV-1  and  is  also  used  in  Figures  IV-5,  IV-6  and  IV-7.  l  or  configurations  up  to  20  servers,  all  machines 
were  located  on  a  single  isolated  token  ling.  Configurations  involving  more  than  20  servers  spanned  multiple  network 
segments.  Each  data  point  was  obtained  from  10  trials. 

Tabic  I1I-3:  Measured  Call  Times  for  Constant  Server  Computation  Times 
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1 


Servers 

10  ms  Compulation 

20  ms  Computation 

50  ms  Computation 

MRPC 

RPC2 

R/M 

MRPC 

RPn 

R/M 

.MRPC 

RPC2 

R/M 

1 

39.32 

(1.11) 

37.75 

(0.95) 

1.02 

50  40 
(6.42) 

47  60 
(l.OO) 

0  94 

77.96 

(7.92) 

1.01 

2 

44.92 

(1.29) 

75.92 

(158) 

1.69 

59  56 
(6.68) 

99  88 
(10.41) 

1.68 

85  28 
(7  74) 

155.% 

(12.06) 

1.83 

5 

70,16 

(3.25) 

208.52 

(3.74) 

197 

79.32 

(198) 

26168 

(5.75) 

330 

108  36 
(12.48) 

414.24 

(31.02) 

382 

10 

11155 

(3.97) 

478.10 

(16.40) 

4.25 

11108 

(9.15) 

53000 

(1795) 

4.73 

14340 

(1001) 

824.76 

(19.72) 

5.75 

15 

16175 

(18.40) 

747.75 

(21.09) 

4.59 

148.16 

(119) 

78136 

(20.76) 

5  29 

185.55 
(14  50) 

127085 

(30.83) 

6.85 

19' 

« 

• 

• 

• 

• 

• 

224  00 
(25.12) 

169870 

(42.18) 

7.58 

20 

208.05 

(7.10) 

956.55 

(35.47) 

460 

197.44 

(116) 

1065.76 

(21.33) 

5.40 

• 

• 

• 

25 

247.48 

(1.19) 

1153.52 

(18.49) 

4.66 

248.96 

(5.14) 

1430.00 

(9175) 

5.74 

258.04 

(13.50) 

2179.00 

(47.62) 

8.44 

50 

780.00 

(34.57) 

2933.10 

(124.90) 

3.76 

77L79 

(8.56) 

3266.20 

(56.84) 

423 

838.20 

(34.70) 

4901.70 

(186.16) 

5.85 

75 

1474.30 

(103.62) 

4579.60 

(57.35) 

3.12 

1238.20 

(177.80) 

5351.07 

(239.53) 

4.32 

1581.50 

(429.48) 

7690.20 

(277.17) 

4.86 

100 

1715.89 

(28.35) 

6120.10 

(61.95) 

3.57 

1637.30 

(UL99) 

7282.70 

(403.44) 

4.45 

1943.60 

(1965.90) 

1008Z40 

(248.46) 

5.19 

All  times  are  in  milliseconds.  Figures  in  parentheses  are  standard  deviations.  This  data  was  obtained  horn  servers 
distributed  over  many  network  segments.  A  subset  of  this  data  is  graphically  presented  in  Figure  IV*!  Each  data  pdnt 
was  obtained  from  10  trials. 


Table  III*4:  Measured  Call  Times  with  Normally  Distributed  Server  Computation  Times 
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10  ins  Compulalion 

.MHPC 

RPC2 

R/M 

37  68 
(3  98) 

36  80 
(6  32) 

0  98 

4563 
(4  14) 

72  80 
(5  35) 

1  53 

72.92 

(4.15) 

206  48 
(14  25) 

183 

11120 

(217) 

439  65 
(2278) 

395 

16290 

(1.52) 

626.70 

(53.22) 

385 

211.95 

(1215) 

843.95 

(5443) 

398 

• 

• 

• 

768.20 

(10.75) 

2835.90 

(170.28) 

3.69 

1358,11 

(139.52) 

4390.60 

(94.70) 

3.23 

1675.63 

(3954) 

_ 

6074,90 

(118.74) 

3.62 

20  ms  Coniputalion 


50  nis  Coiiipiil.Kiun 


1227  73 
(I4«5l) 

1S86  40 
(52.27) 


I1S26S 

(108.17) 

138130 

(88.92) 

3166.40 

(146.35) 

5272.80 

(462.43) 

7335,90 

(41985) 


All  times  are  in  miUisecondi  Figures  in  parentheses  are  standard  deviations.  This  data  was  obtained  from  serven 
distributed  over  many  network  segments.  A  subset  oi'this  data  s  graphically  presented  in  Figure  fV-  3.  Each  dau  pant 
was  obtained  from  10  iriats 

Tabic  III-S:  Measured  Call  Times  with  Exponentially  Distributed  Server  Compuution  Times 
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g 


'UW 


Servers 

10  ms  Compulation 

20  ms  Computation 

SO  ms  Computation 

MKPC 

RPC2 

R/M 

MR  PC 

R1>C2 

R/.VI 

\rapc 

Rl'CZ 

R/M 

1 

34.36 

37.80 

1.10 

44  36 

47.80 

1.08 

74.36 

77  80 

1.05 

2 

39  32 

75.60 

1.92 

49.32 

95.60 

1.94 

7932 

155  60 

1.96 

5 

54  20 

189.00 

3.49 

64.20 

239.00 

3.72 

94  20 

38900 

413 

7 

71.33 

264.60 

3.71 

74.12 

334.60 

4.51 

104.12 

544.60 

5.23 

10 

10166 

378.00 

3.72 

101.66 

478.00 

4.70 

119.00 

778.00 

6.54 

13 

131.99 

491.40 

3.72 

131.99 

621.40 

4.71 

133.88 

1011.40 

7.55 

15 

I5Z21 

567.00 

3.73 

15121 

717.00 

4.71 

15121 

1167.00 

7,67 

17 

17143 

64160 

3.73 

17143 

81160 

4.71 

17143 

1322.60 

7.67 

19 

19165 

718.20 

3.73 

19165 

908.20 

4.71 

19165 

1478.20 

7.67 

20 

20176 

756.00 

3.73 

20176 

956.00 

4,71 

20176 

1556.00 

7.67 

25 

253  31 

945  00 

3.73 

253.31 

1195.00 

4.72 

253.31 

1945.00 

7.68 

50606 

1890.00 

3.73 

506.06 

2390.00 

4.72 

506.06 

3890.00 

7,69 

758.31 

2835  00 

3.74 

758.81 

3585.00 

4.72 

758.81 

5835.00 

7.69 

1011  56 

3780.00 

3.74 

1011.56 

4780.00 

4.73 

1011.56 

7780.00 

7.69 

All  times  are  in  milliseconds.  Hie  server  compuution  time  is  assumed  constant.  This  dau  is  graphically  presented  in 
Figure  lV-4,  and  used  in  Figures  IV-5,  IV-6  and  IV-7. 

Tabic  Predicted  Performance  from  Analytic  Model 
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IV.  Graphs 
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The  data  in  this  graph  is  obtained  ftom  Table  III-}. 

Fisurc  IV*  1:  Measured  Performance  for  Constant  Server  Compuution  Time 
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The  dale  in  this  graph  is  obtained  fhm  Table  ni-d. 

Figure  IV-2:  Measured  Performance  for  Normal  Server  Computation  Time 
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The  date  in  this  graph  li  obuined  ft«n  Table  III-5. 

Kieure  1V*3:  Measured  Performance  for  Exponential  Server  Computation  Time 


Thcdati  in  iMtgnph  is  obtained  Ihm  Table  Ill'd. 

Figure  IV>4:  Predicted  Performance  from  Model  (Constant  Compuation  Time) 
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(/OM6  CCMPUTAriOP  T/M£) 

The  predicted  data  is  firom  Table  III-6.  The  measured  data  is  Trom  Table  (11-3. 

Fisurc  IV'S:  Comparison  of  Predicted  and  Measured  Performance  (10  ms  Computation) 
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(20  M6  COMFUrATiOH  T/MS) 


The  predicted  data  if  from  Table  111-6.  The  mca.sured  data  is  from  Table  III-3. 

Figare  1  V*6:  Comparison  of  Predicted  and  Measured  Performance  (20  ms  Computation) 
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(SO  MS  CmPUTATloM  tlAMS.) 

The  prediaed  dau  is  from  Tabic  111-6.  The  measured  dau  is  from  Table  111-3. 

Figure  I V-7:  Comparison  of  Predicted  and  Measured  Performance  (50  ms  Computation) 
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